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Abstract. A brief introduction is given to some aspects of the statistical description 
of the luminous matter distribution. I review the features of the redshift surveys 
that arise in the statistical analysis of the galaxy clustering. Special topics include 
intensity functions, correlation functions, correlation integrals, multifractals and 
multiscaling. 



1. Introduction 

I am sure that, in spite of the title of the School, at the end of these two weeks, we 
shall have a less dark picture of the three-dimensional distribution of matter in the 
Universe. There are still a huge amount of unsolved problems regarding the origin and 
evolution of the observed large scale structure in the Universe. Although important 
developments have occurred during the last two decades, the task has revealed so 
elusive, that most of the students of this School will have interesting research projects 
on these topics in the following years. 

The statistical study of the clustering patterns formed by the three-dimensional 
distribution of galaxies is one of the most important observational clues to learn about 
the physical processes that led to the large scale structure of the Universe. A detailed 
statistical description of the observed distribution of matter in the Universe is needed 
to confront theoretical predictions of models of structure formation, such as ./V-body 
simulations involving dark matter, against observations. 

The aim of this lecture is to introduce some statistical aspects of the description 
of the clustering in the Universe. This introduction will be followed by more detailed 
lectures given by Drs. Borgani and Coles. 



2. Surveys of galaxy redshifts 



In the 1980s different groups of astronomers started systematic observational programs 
to construct a true three-dimensional fair sample of the Universe. The task consists 
in measuring the location in space of galaxies lying in the studied region of the sky. 
In addition to its angular position, we need to know the distance to each object. This 
is usually done by measuring the redshift z in the spectrum of the galaxies, which is 
related to the line-of-sight recession velocity, v ICC = cz. The Hubble law permits to 
estimate the distance d to each galaxy, as w roc ~ vh = Hd (but see sec. 2.3), and 
therefore to have the three-dimensional map of the Universe. 

We denote by {Si}fL 1 the position of the N galaxies in a portion of the universe 
with volume V. Techniques of point fields statistics may be used to describe the 
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statistical features of the distribution of such galaxies. Some words of caution might 
be given before embarking on the different techniques of the statistical analysis. 
Catalogues of galaxies are not simple point samples as could be many of the planar 
point processes usually studied in the literature of spatial statistics (like the positions 
of trees in forests). To handle properly galaxy samples, we need to have in mind some 
of the characteristics of their construction |jj . 

2.1. Galaxy obscuration 

The cxtragalactic optical light does not reach the Earth uniformly from all directions. 
The plane of the Milky Way, our own Galaxy, is filled with interstellar dust which 
absorbs most of the light coming from extragalactic sources. Therefore catalogues 
are incomplete below galactic latitudes \b\ < 20° — 30°. This fact implies that the 
band of the sky corresponding to low galactic latitudes is usually not considered 
in the optical samples analyzed. The geometry of the three-dimensional regions 
often becomes irregular because of this and other observational constraints. E.g., 
the analysis of the CfA-I || redshift survey is usually performed in the Northern 
Galactic hemisphere (with b > 40°) and equatorial declination S > 0°. Other kind of 
obscurations are related to the observational devices is the mask of the IRAS satellite 
H (§. The QDOT-IRAS redshift survey covers 74 % of the sky after removal of the 
masked regions and the band corresponding to galactic latitude \b\ < 10° (IRAS looks 
in the infrared and safely identifies galaxies closer to the Galactic plane than optically 
observed galaxies). 

The brightness of the galaxies is also affected by the galactic absorption. This 
effect is usually modelled by the cosecant law in latitude Am = A cosec b. 

2.2. Brightness and apparent magnitude limit 

Galaxies in a redshift survey have different intrinsic brightness. Most of the catalogues 
are built by fixing an apparent magnitude limit mi; m . Therefore galaxies with 
m > TOii m are not seen by the telescope or are not considered because of observational 
strategies. An apparent magnitude-limited sample is therefore not uniform in space, 
as intrinsically faint objects are only seen if they are close enough to the Earth. To 
analyze this kind of flux-limited samples we can follow two strategies: 

(i) One can extract volume-limited samples by fixing a value of the depth -D max (in 
h^ 1 Mpc, h being the Hubble constant in units of 100 Mpc -1 km s _1 ) and keeping 
only galaxies brighter than 

M iim = m lim - 25 - 5 log(D max ). (1) 

For example, if the catalogue has an apparent magnitude limit mi; m = 14.5 and 
we consider as the maximum depth of the volume to be studied D max = 100 ft. -1 
Mpc, only galaxies with absolute magnitude M < —20.5 + 51og/i will remain in 
the volume-limited sample. With this strategy, however, we loose a part of the 
information provided by the redshift survey. To avoid this problem we can follow 
the second strategy. 

(ii) The second procedure is based on the knowledge of the selection function: <p(x). 
The function ip(x) gives an estimate of the probability that a galaxy more brilliant 
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than a given luminosity cutoff, at a distance x is included in the sample. If the 
sample is complete up to a distance R, tp(x) — 1 for x < R. For example, in 
Fig. 1, we show the selection function of one QDOT-/iM5subsample for galaxies 
with L > 1O 919 L0. With this luminosity limit, the sample is complete up to 
R = AOh^ 1 Mpc Q. Beyond this distance the selection function tp(x) falls down 
and attains values close to zero for R ~ 200 h^ 1 Mpc. 

The selection function is derived from the luminosity function 4>(L). The 
luminosity function is defined by requiring that the mean number of galaxies, 
per unit volume, with luminosity in the range L to L + dL, is <f){L)dL. The 
empirical luminosity function is often fitted by the analytical expression |^] 

^ dL= **(iT^(-i) d (i)> (2) 

where and a are the fitting parameters, while 0* is related with the number 
density of galaxies. The previous expression in terms of magnitudes is 

4>(M)dM =A<f>* (io°- 4 (^-^)) Q+1 exp ^ 10 0MM,-M)^ m (3) 
where A = | ln(10). Therefore, the selection function is just the ratio 

j M j x) 4>{M)dM _ r(a+l,10°- 4 ( M -- M (*>)) 

S M r W)dM r(a + 1, 100.4(«.-m to)) - W 

where M{x) — mi; m — 25 — 5 log(cc), T being now the incomplete Gamma function, 
M max = max(M(x), M com ) and M com is the absolute magnitude for which the 
catalogue is complete. The parameters of the luminosity function depend on the 
sample 0. Typical values are a = —1.1 and M* = —19.3 + 51og/i. 

Within this strategy, we can assign to each galaxy a weight w — l/ip{x) depending 
on its distance x to us. 



2.3. Redshift- space distortions 

The observed recession velocity i> roc is not only due to the Hubble expansion. Other 
components have to be added to vr to obtain v TCC . Bulk flows or streaming motions 
on large scale or local velocities within clusters on small scales might not be negligible. 
The peculiar velocity is the velocity of a galaxy with respect to the Hubble flow. Let 
us indicate its component along the line-of-sight by i> poc ; then the observed recession 
velocity is 

V Tec = cz = H d + w poc , (5) 

and therefore we have to distinguish between 'redshift space' and 'real space'; the 
first one is artificially produced by setting each galaxy at the distance d obtained 
by considering u pcc = 0, and is therefore a distorted representation of the second one. 
The effect of this radial distortion is clearly illustrated when dense clusters of galaxies, 
almost spherical in real space, appear as structures elongated along the line of sight, 
in redshift space. These structures are known as 'fingers of God'. 
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2.4- Segregation 

The point field formed by the galaxies in the surveyed volume is clearly a marked 
point Held, in the sense that qualitative marks like morphological type and quantitative 
marks like intrinsic luminosity, distinguish different objects within the same catalogue. 
The statistical properties of the spatial distributions of different kinds of objects can 
be different. In fact, it is well established that elliptical galaxies are more frequent in 
denser regions such as rich clusters, while spirals are more often found in low-density 
environments || . Statistical descriptors such as the two-point correlation function or 
multifractal measures might provide us with different results if they are applied on 
different categories of galaxies. Although less evident, it seems also established that a 
certain degree of luminosity segregation exists , at least for galaxies with absolute 
magnitude Mb < —20 + 51og/i. Bright galaxies are stronger correlated that faint 
galaxies. It seems that both kinds of segregation exist but are independent effects. 
Mass segregation will be reviewed in Dr Campos lecture during this School. The 
segregation mechanisms are to be understood on the basis of a convincing structure 
formation theory. 



2.5. Ergodic hypothesis 

Obviously all the statistical measures will be applied to a portion of the Universe. Let 
D be the size of a portion and L be the scale to which the measure refers. If L is not 
much smaller than D, and we apply the same measure to another portion of size D, 
we expect to find different results. This is sometimes referred to as 'cosmic variance'. 
If D ^> L, we shall have many realizations of the probability distribution within our 
sample, and therefore we expect that the results do not depend too much upon the 
studied region. Statisticians would say that we are assuming ergodicity Jl0|, |TT| , in 
the sense that our sample is enough to obtain statistically reliable results |12|, as it 
contains an adequate number of independent realizations. 



3. Statistical measures 



We have summarized in the previous section how the observation of the Universe 
at large scale provide us with obscured, truncated, distorted and segregated samples 
of galaxies. In spite of all their shortcomings, they are of extraordinary interest in 
Cosmology. From the detailed analysis of these samples we learn about the past and 
future of the Universe. 

In this section, we will introduce some of the mathematical techniques often used 
to statistically describe the distribution of galaxies. 



3.1. Second- order characteristics 

I will start this section by using the terminology and notation often employed by 
spatial statisticians and I will relate it with that used by cosmologists. 

In a point process, the galaxy distribution in our case, we can define 11 |l3| 



the second order intensity function \2(xi,X2) as follows: Let us consider two 
infinitcsimally small spheres centered in x~i and xi with volumes dVi and dVi- The joint 
probability that in each of the spheres lies a point of the point process is approximately 



dP = \ 2 {x ll x 2 )dV 1 dV 2 . 



(6) 
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(See [|], [l3| for an exact definition) . If the point field is homogeneous (sometimes called 
stationary), A2(a?t,a52) depends only on the distance r = |aci — X2I and the direction 
of the line passing through x\ and 2^2, < (3 < it: X-i{r,0). If, in addition, the 
process is isotropic, the angle (3 becomes unimportant and the function depends only 
on r, A2(r). In the following, we will assume the Cosmological Principle: the Universe 
at large scales is homogeneous and isotropic. Let n be the mean number density of 
galaxies in a huge volume, assumed to be a fair sample of the Universe. The two-point 
correlation function commonly used in Cosmology is [ [l0[ 

The expected number of points within a distance r from an arbitrary given galaxy 

is 

(N) r = 47ms 2 (l +£{s))ds = — / s 2 X 2 (s)ds. (8) 
Jo 71 Jo 

The last expression may also be referred to as a correlation integral C(r) |Q. 
K(r) = C(r)/n is called the Ripley's If -function [fil| and is extensively used in the 
literature of point fields. In this context n is the first-order intensity function A of the 
point field, which is constant for homogeneous processes. 

3.2. Estimators of £(r) 

Different estimators may be used to evaluate £(r). For volume- limited samples all 
the galaxies have the same weights w = 1, while when selection functions are used to 
account for the incompleteness, each galaxy is counted with a weight w > 1. Davis & 
Peebles |l(| use the estimator 

DD(r) N R 

where DD(r) is the number of pairs with separation r in the galaxy catalogue with 
Nd galaxies and DR(r) is the number of pairs with separation r between the data 
and a random distributed sample with Nr points. Equivalently we can use || [lTj the 
following estimator by averaging over the N galaxies of the sample 

^(^E^TT' (10) 

TV i! — ' nVAr) 

2—1 V / 

where TVj(r) is the number of galaxies lying in a shell of thickness dr at distance r 
from galaxy i and Vi(r) is the volume of the shell lying within the sample boundaries. 

There is enough confidence in the power-law behaviour of the observed galaxy 
two-point correlation function in the range of scales 0.1 < r < 10/i _1 Mpc [l(| [ll|. 



i gg (r)=\j-J , (11) 

where the exponent 7 ~ 1.8 and the correlation length r g ~ 5 h^ 1 Mpc. It is interesting 
to note that the correlation function of clusters of galaxies is compatible with ( pM| ) once 
we replace r g by r c ~ 15 — 30 hr x Mpc. The value of r c is rather controversial due to 
possible selection effects in the compilation of the catalogues of galaxy clusters |H| , 
but however exceeds r g by a significant factor. 
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3.3. Moments of the cell-counts and scaling 

Let us center a sphere of radius r on a galaxy (labeled by i) and call ni(r) the number 
of galaxies in it excluding the central one. Averaging over the N galaxies of the sample 
we get the mean count 
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which provides the correlation integral (||) . We shall say that there is scaling for the 
first moment of the counts of neighbors if 

(N} r oc r° 2 , (13) 

and the constant exponent Z?2 is known as correlation dimensi on |20|| . 

In the range of scales where £(r) S> 1, equations (§1), (pi]), ([l3|)allow us to derive 
a relation between the exponent of the two-point correlation function 7 and D2 

D 2 ~3-j. (14) 

Obviously, in the regime where £(r) is of the order of unity, the previous relation does 
not hold. 

Scaling may be generalized to moments of any order if 

1 N 

i=i 

with scaling indices r(q) independent of r in a suitable interval. There we define the 
generalized dimensions D q — r(q)/(q — 1). For q = 2, we recover the scaling of (|l~3|), 
where r(2) = T>i. When the scaling relation ( |l5| ) holds, we will say that the point 
distribution has multifractal character. In a simple fractal D q = const for all q values, 
while for a multifractal set D q is a decreasing function of q. The meaning of D q is 
clear: when q is positive and large, the denser parts of the point distribution dominate 
the sums in (|l5|), while for negative values of q the sums are dominated by the rarefied 
regions of the point set. 

For q < 2, it is usually more convenient to obtain D q through a different algorithm. 
Let us call r;(n) the radius of the smallest sphere centered at point i and enclosing 
n neighbors, in other words ri(n) is the distance of point i to its nth neighbor. The 
exponents r(q) are obtained through the relation 
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W(r,n) = jj^2n(ny T an 1 " 9 . (16) 



! = 1 



For q = 0, the previous equation provides a way for estimating the Hausdorff 
dimension Do- Galaxy samples such as the CfA-I provide a value of Dq — 2.1, while 
the correlation dimension of the same sample is Z?2 — 1.3 |£l| in the range of scales 
1 < r < 10 ft. -1 Mpc. The whole D q function of a volume-limited limited subsample of 
the CfA-I catalogue is illustrated in Fig. 2 (solid line). Equation (15) has been used 
for q > 2, while Equation (16) has been used for q < 2. 
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3.4- Multisacling 

Beyond 10/i _1 Mpc it is much more difficult to estimate the galaxy-galaxy two- 
point correlation function. Nevertheless, integral quantities such as C(r) can still 
be estimated with enough reliability. For r < 10 hr 1 Mpc, there is also information 
about the statistics of the distribution of clusters of galaxies. In this section, we will 
provide an explanation work of the clustering of different objects such as galaxies or 
clusters within the same theoretical framework. 

If the matter distribution is considered a continuous density field, we could think 
of galaxies as being the peaks of the field above some given threshold. A larger 
threshold will correspond to clusters of galaxies. The higher the threshold the richer 
the galaxy cluster. We can use the stochastic model shown in Fig. 3(a) to illustrate 
this behaviour (for details see [^2| ^3|). By applying different density thresholds, which 
are quite naturally defined in this model, we obtain the distributions shown in Figs. 
3(6), (c). The parameters of the model were chosen to provide values for Dq = 2.0 and 
for D2 = 1.3. The whole D q function of this model is illustrated in Fig. 2 (dashed 
line). 

The value of D2 is approximately the slope of the log-log plot Z(2, r) vs. r shown 
as a solid line in Fig. 4. After applying the threshold, Z(2,r) stills follows a power 
law, but with different slope (see dotted and dashed lines in the figure) . In the plot we 
can see that the higher the density threshold, the lower is D2. Multiscaling is a scaling 
law where the exponent is slowly varying with the length scale due to the presence of 
a threshold density defining the objects J24J. 

We have seen that the observed matter distribution in the Universe follows some 
sort of multiscaling behaviour |l4j . If galaxies and clusters of galaxies with increasing 
richness arc considered as different realizations of the selection of a density threshold 
in the mass distribution, the multiscaling argument implies that the corresponding 
values of the correlation dimension D2 must decrease with increasing density. 

We shall show the correlation integral for galaxy samples and for cluster samples in 
the range [10, 50] hT 1 Mpc. In this range of scales £ S9 (r) does not follow a power-law 
shape, while G(r) is nicely fitted to a power-law shape. For galaxies we have analyzed 
the CfA-I sample, the Pisces- Perseus sample [^5| and the QDOT-iiMS'redshift survey 
JjJ. The cluster samples are the Abell and ACO catalogues ^6|, the Edinburgh- 
Durham redshift survey p7j] , the ROSAT X-ray-selected cluster sample |28) and the 
APM cluster catalogue |29|] . 

In Fig. 5 we see that three straight lines fit reasonably well the eight samples 
analyzed. All the cluster samples have a correlation integral well fitted by a power- 
law with exponent D2 — 2.1. A value of D2 — 2.5 appears for the optical galaxy 
catalogues: the CfA volume-limited sample and the Pisces-Perseus survey. Finally a 
value of D2 — 2.8 is obtained for the QDOT-IRAS galaxies. These results probe the 
multiscaling behaviour of the matter distribution in the Universe. 

The fact that D2 for IRAS galaxies is larger than for optical samples indicates 
that IRAS galaxies are less correlated than optical galaxies; this is nicely interpreted 
if optical galaxies correspond to higher peaks of the density field. Clusters of galaxies 
have stronger correlations than galaxies, corresponding to the highest peaks of the 
background matter density. 
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4. Conclusions 

We have given a short introduction to different aspects of the characterization of 
the observed surveys of galaxy clustering by means of different statistical techniques. 
Redshift surveys, when considered as point processes, have peculiar features which 
can be expressed through statistical tools. Obscuration by dust in our own Galaxy, 
truncation in luminosity and the use of selection functions for flux-limited samples have 
been discussed in some detail. The analysis of dumpiness is often done in redshift 
space, which has important distortions when compared to real space. Morphological 
and luminosity segregation is an important clue for testing galaxy formation theories. 
It is interesting to consider the galaxy distribution as a marked point process. We have 
illustrated the relationship of the two-point correlation function to other statistical 
quantities such as the intensity functions or cumulant quantities such as the correlation 
integral. The multifractal nature of the matter distribution comes from the scaling 
of the moments of the cell-counts. We have introduced the concept of multiscaling 
to provide a neat scheme for the explanation of the clustering of galaxies of different 
kinds and clusters with different richness. In this context, we have shown how the 
correlation dimension D 2 attains specific values for each kind of cosmic object, being 
a clear measure of their clustering. 
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Figure captions 

Figure 1. The selection function of the QDOT-IRAS redshift survey. 

Figure 2. The generalized dimensions D q for a volume-limited subsamplc of the 
Cf A-I catalog (solid line) and for the multifractal model shown in Fig. 3 (dashed line) . 

Figure 3. A multifractal stochastic model for the galaxy distribution (a). In (6) 
and (c) we see the same model after applying increasing density thresholds. 

Figure 4. The function Z(2,r) — C(r)/n for the model shown in Fig. 3. The 
slope D 2 is lower for samples with higher density threshold. 

Figure 5. The correlation integral for different galaxy and cluster samples 
(reproduced with permission from Martinez et al. Science 269, 1245. Copyright 
1995 American Association for the Advancement of Science). 



